A place for everything
A vector is a storage container for data of a uniform data class type.
A vector contains similar data types and each element can be accessed using numerical indices nested with square brackets [ & ].1
Because x is a vector AND it contains numeric data, the introspection operators for both vector and numeric will return TRUE.
The data in x ARE both vectors and numeric types.
As long as the base data type is the exact same, vectors will always work properly.
You CANNOT mix data types in a single vector and keep the same kinds of data. R will coerce to a least common data type so that they are all of the same type.
Sometimes it is helpful to make a a sequence of values in a vector. R has some built-in functionality here for that.
Data within vectors can be subjected to unary opertors.
As well as binary operators.
If you attempt to perform a binary operator on two vectors whose lengths are different, it will recycle the values in the shorter one.
For a random and entirely made up example of a homework assignment, the participants raw scores were recored as: 32, 31, 45, 29, 17, 40, 26, and 23. This was out of 45 total points. In R, do the following:
min(), sum(), max(), and mean() to derive these mathematical propoerties.For some mathematical operations, we need to work with matrices. These are another ‘general’ container but with dimensions for rows and columns of data.
Creating matrices are done columnwise, if you want them to be rowwise, you have to ask for it.
Just like vectors, the square brackets are used to access values within a matrix. However, there are now two indices, one for the row and one for the column.
You can get an entire row or column using what is called a slice index.
Arithamatic operators on matrices work the same way (as long as they are matrices of the proper number of rows and columns).
This is element-wise multiplication (aka a Kronecker Product).
Matrix multiplication is a bit more complicated as it is a slightly more involved .
For most of you, this will be the only time you’ll be working with matrices (so soak in the glory of the moment it is all non-matrix R reality from here on out!). Using the a sequence of numbers from 20 to 42:
byrow does to a 4x5 matrix (e.g., try both byrow=TRUE and byrow=FALSE).Lists are more versatile containers in that they allow you to store different kinds of data in them.
By default, they are numerically indexed .
Notice that lists use two sets of square brackets instead of one—to differentiate itself from a normal vector
This is because technically, the first element in the list is an also a list and what we are trying to get from that is the first element inside that contained list.
Lists can be made more friendly to you by using actual names for the keys associated with each value. In some languages, like python, these are referred to as dictionaries.
Notice the use of the $ in the output
This $ notation is used to easily grab the contents of the list at that slot.
As well as to add new entries to the list directly.
You can also use the double brackets AND the name of the key as a reference.
However this is even more work and looks a bit less elegant than the $ notation. Also, if you look at the order of operations, you’ll see that the $ notation has a higher precedence in operations than the single or double brackets (see ?Syntax).
In R, you will most likely work with list objects as analysis results rather than as a container to keep your data. Almost all analyses return their values as a list with the included components. Here is an example.
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50
Here is a quick correlation between the sepal and pedal lengths in some iris data set.
Pearson's product-moment correlation
data: iris$Sepal.Length and iris$Petal.Length
t = 21.646, df = 148, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8270363 0.9055080
sample estimates:
cor
0.8717538
Values
statistic.t 21.6460193457598
parameter.df 148
p.value 1.03866741944978e-47
estimate.cor 0.871753775886583
null.value.correlation 0
alternative two.sided
method Pearson's product-moment correlation
data.name iris$Sepal.Length and iris$Petal.Length
conf.int1 0.827036329664362
conf.int2 0.905508048821454
Printing results show the components of the analysis in a way that makes sense because while it is a list
Pearson's product-moment correlation
data: iris$Sepal.Length and iris$Petal.Length
t = 21.646, df = 148, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8270363 0.9055080
sample estimates:
cor
0.8717538
[1] "htest"
This is awesome because it makes it much easier to use something like the values stored in iris.test to insert the data from our analyses directly (inline) into our text.
There was a significant relationship between sepal and petal length (Pearson’s product-moment correlation, \(\rho =\) 0.872, \(t =\) 21.6, P = 1.04e-47).
Do a quick analysis of width for sepals and petals in the three iris species.
The main container that almost all of your data will be contained in is the data.frame.
weight, longitude, survived)Lets consider the following data as indiviudal vectors.
These can be put into a data.frame as:
Each column in a data.frame is a self-contained set of data all of the same type and as such can be summarized.
Just like in a list, the columns of a data.frame are accessed by their names, and we can use the $ notation.
The easiest way to index values in a data.frame is to use the $ notation to grab the column (as a vector object) and then to use the square brackets to access a specific element.
You can also use the numerical indices for both row and column in the data.frame (n.b., it is row first then column).
names homework.1 homework.2
1 Bob 0.78 0.85
2 Alice 0.95 0.89
3 Jane 0.82 0.92
4 Norm NA 0.79
The size of the elements contained in a data.frame are then relevant.
You will almost never create data.frame objects de novo but instead load data in from some external resource. There are several functions that simplify this within tidyverse so let’s make sure we have it loaded into memory.
Here is a CSV file that is contained in this repository. Since it is a public repository, we can access it from within GitHub using a URL.
Remember René Magritte’s Pipe? I used this for a reason:
We use the term “pipe” in the sense of making a connection of data flows from one step to the next.
\[ Load\;Data \to Format\;Dates \to Scale\;Values \to Make\;Plot \]
Originally, there was a library named magrittr that defined one of those compound operators. Where instead of doing the function call like this with function( data )
We could take the beetles object and pipe it (e.g., pass its values) into the summary function (as the first argument that summary receives).
In fact, we could get rather expressive with this kind of piping and use built-in indentation rules to make the code significantly more readable.
Compare the following code that summarizes the first 10 entries in the beetles data set.
In fact, this became so popular, that the R language gurus decided to make a pipe operator that does not need the magrittr library at all (and is only 2 characters in length). You will see both of these operators in action.
From the beetle data, how would you estimate the centroid coordinate of the data set?
There are many built-in data sets that we can play with. Let’s copy one of these and then practice adding and deleating from it.
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
mpg cyl disp hp
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
Median :19.20 Median :6.000 Median :196.3 Median :123.0
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
drat wt qsec vs
Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
Median :3.695 Median :3.325 Median :17.71 Median :0.0000
Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
am gear carb
Min. :0.0000 Min. :3.000 Min. :1.000
1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
Median :0.0000 Median :4.000 Median :2.000
Mean :0.4062 Mean :3.688 Mean :2.812
3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
Max. :1.0000 Max. :5.000 Max. :8.000
Some of the data are continuous
[1] 3.90 3.90 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 3.92 3.07 3.07 3.07 2.93
[16] 3.00 3.23 4.08 4.93 4.22 3.70 2.76 3.15 3.73 3.08 4.08 4.43 3.77 4.22 3.62
[31] 3.54 4.11
We can delete columns from a data.frame object by assigning it NULL.
To add a row to a data.frame, we really need to make an identical data.frame and then bind it onto the bottom of it.
The mtcars data set has the name of the car in the data but not as a column of data itself… This is an older way of doing it and one that is not commonly used any more.
To make dyerVW have a name of the vehicle in the row, to make it like the mtcars one, we use the function rownames()
OK, so we are ready to bind them (let’s verify they have the same columns).
onto the original data set
mpg disp hp qsec Dreamy
Ford Pantera L 15.8 351 264 14.5 TRUE
Ferrari Dino 19.7 145 175 15.5 TRUE
Maserati Bora 15.0 301 335 14.6 TRUE
Volvo 142E 21.4 121 109 18.6 FALSE
Why?
There are many times that we can use real names for columns of data. This is beneficial to use because when we plot it or make a table, if we use an abbreviated name like hp or qsec, we’ll have to fix the labels or do some other work around.
Dyer’s Rule #1: Use informative names for your data.
Since we cannot have spaces in variable names (and the columns of a data.frame are just variables), we need to enclose a compound so R recognizes it as a single entity instead of 2 or more variable names.
MPG Displacement Horse Power Quarter Mile Dream Car
Ford Pantera L 15.8 351 264 14.5 TRUE
Ferrari Dino 19.7 145 175 15.5 TRUE
Maserati Bora 15.0 301 335 14.6 TRUE
Volvo 142E 21.4 121 109 18.6 FALSE
Volkswagen Beetle 21.4 91 53 20.9 TRUE
Occasionally, you’ll need to sort a data.frame to get some inference out of it (e.g., slowest Quarter Mile, best MPG, etc.) We can use the arrange() function (actually from dplyr but will be diving into it next week) to easily do this.
To sort in reverse, we use the negative character to indicate sorting in decreasing order.
We can sort the whole data.frame using multiple columns but adding them to the call as additional arguments (n.b., a logical sorts in numerical value with FALSE == 0 and TRUE > 0. suck it Volvo!!!).